Wikipedia recommender system

Created by:

Table of contents

  1. Crawling and scraping
  2. Stemming, lemmatization
  3. Similarities
  4. Examples

Crawling and scraping

1. Selcting topic of our dataset

We decided to select James Bond world, we used James Bond Fandom site to scrape over 1500 content of websites to our dataset.

Function to get list of sites with desirable content

We begin from scraping list of part of the sites of desire content which can be fund here.

Total number of sites:

Function which creates dict of content from given sites

We take paragraphs from websites, which do not contain other tags then a, b, i, span, because we observe that usually when other tags are, sites contain some useless informations like tables or description of images.

Since downloading the content of pages takes a long time, we save the generated dict for optimization purposes the next time the program is launched

Example of website content.

Covert dict of sites & content to DataFrame for preprocess

Save results to csv

Load from CSV file

Stemming, lemmatization

Preprocessing our dataset

Our preprocessing of each content of sites looks like this:

(for data this process could take 2 min)

Applying Stemming Algorithm to all contents

Drop all sites with no text.

Create bag of words model

Now we change each string that appear in ours lists into separate column, and put number of occurrence given string in given rows text.

Dropping columns that has almost non 1 and non usable columns.

Also drop all rows thats have same values, because many links points to the same page.

Most frequents words

This chart present most frequent words in our model.

Similarities

TF-IDF form

Let's start to create our TF-IDF form.

IDF need binary representation.

IDF

TF-IDF

Calculate length for each vector.

Cosine similarity

Function that calculate most similar pages for list of pages.

We decided that treat all articles in input query as one. To do this we take centroid from all articles(add all values and then divide by number of articles). Then for our new query q, we will calculate cosine similarity for each document d that isn't one of our initial documents like this:

$\frac{\sum_{i=1}^{T} q_i*d_i}{\sqrt{\sum_{i=1}^{T} q_i^2}*\sqrt{\sum_{i=1}^{T} d_i^2}}$

where T is set of all terms and ${q_i}$ and ${d_i}$ are occurence of i terms in q or in d.

Examples

For 1 and 2 our model returns 1, 2, 3, 4, 5.

So for 2 sites of women actress, its return sites of other women actress.

For 1 and 2) our model returns 1, 2), 3, 4, 5.

So for 2 sites of BMW, our model return 3 sites of other BMW and other car.

For 1 and 2 and 3 or model returns 1, 2), 3, 4, 5.

So for 3 articles about high nerd level novel, it returns 4 articles of same fashion.

For article about movie Goldenfinger) our model returns article about Goldfinger himself, theme song of the movie, actor from the movie, and some character from novel about Goldfinger.

Visualizations

For proving of correctness of our model(cosine similarity of average of query), let's check how it work when we compare it with classical cosine similarity, which is well establish as working.

And we can observe that it is works as it should be! All our results are outstanding from others sites.

Now lets check which words have biggest impact on choosing these sites.

First lets choose from our TF-IDF matrix only these rows which contain either our query, or sites return by our model. Then drop all columns which have zero in all query, and then drop all columns which have zeros in all sites return by our model.

Again combine our query into one row.

Then create new matrix, in which we will hold percentage share of each word in choosing these site.

Now for purpose of visualization, we need to transform a dataframe

Vous voilà! Now we can easly see which words have biggest impact on choosing given